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Abstract 

Data  mining  is  an  emerging  research  area,  whose  goal  is  to  extract  significant  patterns 
or  interesting  rules  from  large  databases.  High-level  inference  from  large  volumes  of 
routine  business  data  can  provide  valuable  information  to  businesses,  such  as  cus¬ 
tomer  buying  patterns,  shelving  criterion  in  supermarkets  and  stock  trends.  Many 
algorithms  have  been  proposed  for  data  mining  of  association  rules.  However,  research 
so  far  has  mainly  focused  on  sequential  algorithms. 

In  this  paper  we  present  parallel  algorithms  for  data  mining  of  association  rules, 
and  study  the  degree  of  parallelism,  synchronization,  and  data  locality  issues  on  the 
SGI  Power  Challenge  shared-memory  multi-processor.  We  further  present  a  set  of 
optimizations  for  the  sequential  and  parallel  algorithms.  Experiments  show  that  a 
significant  improvement  of  performance  is  achieved  using  our  proposed  optimizations. 
We  also  achieved  good  speed-up  for  the  parallel  algorithm,  but  we  observe  a  need  for 
parallel  I/O  techniques  for  further  performance  gains. 
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Hashing,  Shared-Memory  Multi-processor 
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1  Introduction 


With  large  volumes  of  routine  business  data  having  been  collected,  business  orga¬ 
nizations  are  increasingly  turning  to  the  extraction  of  useful  information  from  such 
databases.  Such  high-level  inference  process  may  provide  information  on  customer 
buying  patterns,  shelving  criterion  in  supermarkets,  stock  trends,  etc.  Data  mining  is 
an  emerging  research  area,  whose  goal  is  to  extract  significant  patterns  or  interesting 
rules  from  such  large  databases. 

Data  mining  is  in  fact  a  broad  area  which  combines  research  in  machine  learn¬ 
ing,  statistics  and  databases.  It  can  be  broadly  classified  into  three  main  categories 
[1];  Classification  -  finding  rules  that  partition  the  database  into  disjoint  classes; 
Sequences  -  extracting  commonly  occurring  sequences  in  temporal  data;  and  Associ¬ 
ations  -  find  the  set  of  most  commonly  occurring  groupings  of  items.  In  this  paper 
we  will  concentrate  on  data  mining  for  association  rules. 

The  problem  of  mining  association  rules  over  basket  data  was  introduced  in  [2]. 
Basket  data  usually  consists  of  a  record  per  customer  with  a  transaction  date,  along 
with  items  bought  by  the  customer.  An  example  of  an  association  rule  over  such  a 
database  could  be  that  80%  of  the  customers  that  bought  bread  and  milk,  also  bought 
eggs.  The  data  mining  task  for  association  rules  can  be  broken  into  two  steps.  The 
first  step  consists  of  finding  all  the  sets  of  items,  called  as  itemsets,  that  occur  in 
the  database  with  a  certain  user-specified  frequency,  called  minimum  support.  Such 
itemsets  are  called  large  itemsets.  An  itemset  of  k  items  is  called  a  k-itemset.  The 
second  step  consists  of  forming  implication  rules  among  the  large  itemsets  found  in 
the  first  step.  The  general  structure  of  most  algorithms  for  mining  association  rules 
is  that  during  the  initial  pass  over  the  database  the  support  for  all  single  items  (1- 
itemsets)  is  counted.  The  large  1-itemsets  are  used  to  generate  candidate  2-itemsets. 
The  database  is  scanned  again  to  obtain  occurrence  counts  for  the  candidates,  and 
the  large  2-itemsets  are  selected  for  the  next  pass.  This  iterative  process  is  repeated 
for  A;  =  3, 4,  •  •  • ,  until  there  are  no  more  large  A;-itemsets  to  be  found. 

1.1  Related  Work 

Many  algorithms  for  finding  large  itemsets  have  been  proposed  in  the  literature  since 
the  introduction  of  this  problem  in  [2]  (AIS  algorithm).  In  [7]  a  pass  minimization 
approach  was  presented,  which  uses  the  idea  that  if  an  itemset  belongs  to  the  set  of 
large  {k  +  e)-itemsets,  then  it  must  contain  A:-itemsets.  The  Apriori  algorithm 

[4]  also  uses  the  property  that  any  subset  of  a  large  itemset  must  itself  be  large. 
These  algorithms  had  performance  superior  to  AIS.  Newer  algorithms  with  better 
performance  than  Apriori  were  presented  in  [8,  10].  The  DHP  algorithm  [8]  uses  a 
hash  table  in  pass  k  to  do  efficient  pruning  of  (A:-t-l)-itemsets.  The  Partition  algorithm 
[10]  minimizes  I/O  by  scanning  the  database  only  twice.  In  the  first  pass  it  generates 
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the  set  of  all  potentially  large  itemsets,  and  in  the  second  pass  the  support  for  all 
these  is  measured.  The  above  algorithms  are  all  specialized  black-box  techniques 
which  do  not  use  any  database  operations.  Algorithms  using  only  general-purpose 
DBMS  systems  and  relational  algebra  operations  have  also  been  proposed  [5,  6]. 

There  has  been  very  limited  work  in  parallel  implementations  of  association  al¬ 
gorithms.  In  [9],  a  parallel  implementation  of  the  DHP  algorithm  [8]  is  presented. 
However  only  simulation  results  on  a  shared-nothing  or  distributed-memory  machine 
like  IBM  SP2  were  presented.  Parallel  implementations  of  the  Apriori  algorithm  on 
the  IBM  SP2  were  presented  in  [3].  There  has  been  no  study  on  shared-everything  or 
shared-memory  machines  to-date.  In  this  paper  we  present  parallel  implementations 
of  the  Apnon  algorithm  on  the  SGI  Power  Challenge  shared-memory  multi-processor. 
We  study  the  degree  of  parallelism,  synchronization,  and  data  locality  issues  in  par¬ 
allelizing  data  mining  applications  for  such  architectures.  We  also  present  a  set  of 
optimizations  for  the  sequentialApnon  algorithm,  and  for  the  parallel  algorithms  as 
well. 

The  rest  of  the  paper  is  organized  as  follows.  In  the  next  section  we  formally 
present  the  problem  of  finding  association  rules,  and  briefly  describe  the  Apriori 
algorithm.  Section  3  presents  a  discussion  of  the  parallelization  issues  for  each  of 
the  steps  in  the  algorithm,  while  section  4  presents  some  effective  optimizations  for 
mining  association  rules.  Section  5  presents  our  experimental  results  for  the  different 
optimization  and  the  parallel  performance.  Finally  we  conclude  in  section  6. 


2  Sequential  Data  Mining 

We  now  present  the  formal  statement  of  the  problem  of  mining  association  rules  over 
basket  data.  The  discussion  below  closely  follows  that  in  [2,  4] . 

Let  X  =  {ii,i2,  •  •  •  Am}  be  a  set  of  m  distinct  attributes,  also  called  items.  Each 
transaction  T  in  the  database  V  of  transactions,  has  a  unique  identifier  TID,  and 
contains  a  set  of  items,  such  that  TCI.  An  association  rule  is  an  expression  A=^  B, 
where  itemsets  A,B  C  2,  and  A  fl  B  =  0.  Each  itemset  is  said  to  have  a  support 
s  if  s%  of  the  transactions  in  V  contain  the  itemset.  The  association  rule  is  said 
to  have  confidence  c  if  c%  of  the  transactions  that  contain  A  also  contain  B,  i.e., 
c  =  support{A  U  B)fsupport{A),  i.e.,  the  conditional  probability  that  transactions 
contain  the  itemset  B,  given  that  they  contain  itemset  A.  For  example,  we  may  have 
that  80%  of  the  customers  that  bought  bread  and  milk  also  bought  eggs.  The  number 
80%  is  the  confidence  of  the  rule,  while  the  support  of  the  rule  is  support{A  U  B). 
Data  mining  of  association  rules  from  such  databases  consists  of  finding  the  set  of  all 
such  rules  which  meet  the  user-specified  minimum  confidence  and  support  values. 

The  task  of  data  mining  for  association  rules  can  be  broken  into  two  steps: 

1.  Find  all  the  large  fc-itemsets  for  fc  =  1,2,  •  •  •. 
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2.  Generate  rules  from  these  large  itemsets.  Given  that  X  is  a.  large  A:-itemset,  for 
every  non-empty  subset  A  C  X,  a,  rule  of  the  form  A=^  B  is  generated,  where 
B  =  X  —  A,  and  provided  that  this  rule  has  the  required  confidence.  We  refer 
the  reader  to  [4]  for  more  detail  on  rule  generation.  Henceforth,  we  will  deal 
only  with  the  first  step. 

In  this  paper  we  will  discuss  the  parallelization  of  the  Apriori  algorithm  on  the 
SGI  Challenge  shared-memory  machine.  We  begin  by  presenting  a  brief  discussion  of 
the  algorithm. 


2.1  The  Apriori  Algorithm 

The  naive  method  of  finding  large  itemsets  would  be  to  generate  all  the  2™  subsets 
of  the  universe  of  m  items,  count  their  support  by  scanning  the  database,  and  output 
those  meeting  minimum  support  criterion.  It  is  not  hard  to  see  that  the  naive  method 
exhibits  complexity  exponential  in  m,  and  is  quite  impractical.  Apriori  follows  the 
basic  iterative  structure  discussed  earlier.  However  the  key  observation  used  is  that 
any  subset  of  a  large  itemset  must  also  be  large.  During  each  iteration  of  the  algorithm 
only  candidates  found  to  be  large  in  the  previous  iteration  are  used  to  generate  a  new 
candidate  set  to  be  counted  during  the  current  iteration.  A  pruning  step  eliminates 
any  candidate  which  has  a  small  subset.  The  algorithm  terminates  at  step  t,  if  there 
are  no  large  t-itemsets.  The  general  structure  of  the  algorithm  is  given  in  figure 
1,  and  a  brief  discussion  of  each  step  is  given  below  (for  details  on  its  performance 
characteristics,  we  refer  the  reader  to  [4]).  In  the  figure  Lk  denotes  the  set  of  large 
A;-itemsets,  and  Ck  the  set  of  candidate  fc-itemsets. 


Li  =  {large  1-itemsets  }; 
for  {k  =  2;  Lk-i  ^  0;  -f  -f) 

/*Candidate  Itemsets  generation*/ 

Ck  =  Set  of  New  Candidates; 

/*Support  Counting*/ 
for  all  transactions  t  ^T> 
for  all  /^-subsets  s  of  t 
if  (s  G  Ck)  s.count  A  -f; 

/*Large  Itemsets  generation*/ 

Ta;  =  {c  G  Ck\c.count  >  minimum  support}; 
Set  of  all  large  itemsets  =  (Ja;  Lk', 


Figure  1:  The  Apriori  Algorithm 
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•  Candidate  Itemsets  Generation:  The  candidates  Ck  for  the  k-th.  pass  are 
generated  by  joining  Lk-i  with  itself,  which  can  be  expressed  as: 

Ck  =  {x  \x[l:k -2]  =  A[l:k -2]  =  B[1  :  k  -  2],  x[k  -  1]  =  A[k  -  1], 
x[k]  =  B[k  -  1],  A[k  -  1]  <  B[k  -  1],  where  A,Be  Lk-i} 

where  x[a  :  b]  denotes  items  at  index  a  through  b  in  itemset  x.  Before  inserting 
X  into  Ck,  we  test  whether  all  (/i;  -  l)-subsets  of  x  are  large.  This  pruning  step 
eliminates  any  itemset  at  least  one  of  whose  subsets  is  not  large.  The  candidates 
are  stored  in  a  hash  tree  to  facilitate  fast  support  counting.  An  interior  node 
of  the  hash  tree  at  depth  d  contains  a  hash  table  whose  cells  point  to  nodes  at 
depth  d  +  1.  The  size  of  the  hash  table,  also  called  the  fan-out,  is  denoted  as  .F. 
All  the  itemsets  are  stored  in  the  leaves.  To  insert  an  itemset  in  Ck,  we  start  at 
the  root,  and  at  depth  d  we  hash  on  the  d-th.  item  in  the  itemset  until  we  reach 
a  leaf.  If  the  number  of  itemsets  in  that  leaf  exceeds  a  threshold  value,  that 
node  is  converted  into  an  internal  node.  We  would  generally  like  the  fan-out  to 
be  large,  and  the  threshold  to  be  small,  to  facilitate  fast  support  counting.  The 
maximum  depth  of  the  tree  in  iteration  k  Is  k. 

•  Support  Counting:  To  count  the  support  of  candidate  /c-itemsets,  for  each 
transaction  T  in  the  database,  we  form  all  fc-subsets  of  T  in  lexicographical 
order.  This  is  done  by  starting  at  the  root  and  hashing  on  items  0  through 
(n  -  fc  -I-  1)  of  the  transaction.  If  we  reach  depth  d  by  hashing  on  item  i  then 
we  hash  on  items  i  through  (n  -  A;  +  1)  +  d.  This  is  done  recursively,  until  we 
reach  a  leaf.  At  this  point  we  increment  the  count  of  all  itemsets  in  the  leaf 
that  are  contained  in  the  transaction  (note:  this  is  the  reason  for  having  a  small 
threshold  value). 

•  Large  Itemsets  Generation:  Each  itemset  in  Ck  with  minimum  support  is 
inserted  into  Lk,  a  sorted  linked-list  denoting  the  large  fc-itemsets.  Lk  is  used 
in  generating  the  candidates  in  the  next  iteration. 

We  now  present  a  simple  example  of  how  Apriori  works.  Let  the  database,  P  = 
{Ti  =  (l,4,5),r2  =  (l,2),r3  =  (3,4,5),T4  =  (1,2, 4, 5)}.  Let  the  minimum  support 
value  MS  =  2.  Running  through  the  iterations,  we  get 

Li  =  {{1},{2},{4},{5}} 

(72  =  {{1,2},{1,4},{1,5},{2,4},{2,5},{4,5}} 

L,  =  {{1,2},{1,4},{1,5},{4,5}} 

(73  =  {{1,4,5}} 

Ls  =  {{1,4,5}} 

Note  that  while  forming  C3  by  joining  I2  with  itself,  we  get  three  potential  candidates, 
{1,2,4},  {1,2,5},  and  {1,4,5}.  However  only  {1,4,5}  is  a  true  candidate,  and  the 
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first  two  are  eliminated  in  the  pruning  step,  since  they  have  a  2-subset  which  is  not 
large  (the  2-subset  {2,4},  and  {2,5}  respectively). 


3  Parallel  Data  Mining:  Design  Issues 

In  this  section  we  present  the  design  issues  in  parallel  data  mining  for  association 
rules  on  shared  memory  architectures.  We  separately  look  at  each  of  the  three  steps 
-  candidate  generation,  support  counting,  and  large  itemsets  generation. 

3.1  Candidate  Itemsets  Generation 

Optimized  Join  and  Pruning 

Recall  that  in  iteration  k,  Ck  is  generated  by  joining  Lk-i  with  itself.  For  example, 
let  L2  =  {AB,  AC^  AD,  AE,  BE,  BE,  BG,  DE}.  The  naive  way  of  doing  the  join  is  to 
look  at  all  ^2)  =  28  combinations.  However,  since  Lk-i  is  lexicographically  sorted,  we 
partition  the  itemsets  in  Lk-i  into  equivalence  classes  based  on  their  common  k  —  2 
prefixes,  i.e.,  the  equivalence  class  of  a  G  Lk-2,  is  given  as 

[a]  =  {6  G  Lk-i\a[l  :  A:  —  2]  =  6[1  :  A:  —  2]} 

The  partition  of  L2  above  would  give  us  the  equivalence  classes:  Si  =  [A]  =  {AB,  AC,- 
AD,AE},  S2  =  [B]  =  {BE,BF,BG},  and  .S3  =  [D]  =  {DE}.  Let  I([X])  = 
{6[A:  —  1]|5  G  [X]},  denote  the  items  at  the  (k  —  l)-st  position.  Then  X([A])  = 
{B,C,  D,  E},  T([B])  =  {E,F,G},  and  X([D])  =  {F^}.  A;-itemsets  are  formed  only 
from  items  within  a  class  by  taking  all  2*')  and  prefixing  them  with  the 

class  identifier.  For  example,  consider  the  items  in  J([B]).  By  choosing  all  pairs,  we 
get  EE,  EG,  and  EG,  and  after  prefixing  the  class  identifier  we  obtain  the  candidates 
BEE,  BEG,  and  BEG.  (note:  any  equivalence  class  with  a  single  element  (e.g.  S^) 
cannot  produce  a  new  itemset.  Thus  such  classes  can  be  eliminated  from  the  gener¬ 
ation  step.  However  we  must  still  retain  them  for  the  pruning  step).  We  thus  only 
have  to  look  at  -1-  =  6  -|-  3  =  9  combinations.  In  general  we  have  to  consider 

('2*')  combinations  instead  of  ('^2”^^)  combinations. 

While  pruning  a  candidate  we  have  to  check  if  all  k  of  its  (A;  —  l)-subsets  are  large. 
Since  the  candidate  is  formed  by  an  item  pair  from  the  same  class,  we  need  only  check 
for  the  remaining  k  —  2  subsets.  Furthermore,  assuming  all  S,  are  lexicographically 
sorted,  these  k  —  2  subsets  must  come  from  classes  greater  than  the  current  class. 
Thus,  to  generate  a  candidate,  there  must  be  at  least  k  —  2  equivalence  classes  after 
a  given  class.  In  other  words  we  need  consider  only  the  first  n  —  (k  —  2)  equivalence 
classes.  We  can  even  further  refine  the  estimate  of  how  many  classes  follow  a  given 
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class  by  considering  only  those  classes  with  elements  greater  than  any  of  the  item 
pairs.  For  example,  there  is  no  need  to  form  3-itemsets  from  iS'2.  Consider  the  result 
of  join  of  items  E  and  F,  which  produces  EF.  Although  there  is  a  class  after  [B], 
namely  [C],  since  it  is  lexicographically  smaller  than  EF  we  can  safely  reject  S2  from 
further  consideration. 


Adaptive  Hash  Table  Size  (J^)  :  Having  equivalence  classes  also  allows  us  to 
accurately  adapt  the  hash  table  size  for  each  iteration.  For  iteration  k,  and  for  a 
given  threshold  value  T,  i.e.,  the  maximum  number  of  fc-itemsets  per  leaf,  the  total 
fc-itemsets  that  can  be  inserted  into  the  tree  is  given  by  the  expression:  TE'^.  Since 
we  can  insert  upto  ELo  ('?')  itemsets,  we  get  the  expression:  T >  E”=o  ('2  )• 
This  can  be  solved  to  obtain: 


Computation  Balancing 

Let  the  number  of  processors  "P  =  3,  A:  =  2,  and  Li  =  {0, 1,2, 3, 4, 5, 6,  7, 8, 9}.  There 
is  only  1  resulting  equivalence  class  since  the  k  —  2  (=0)  length  common  prefix  is 
null.  The  the  number  of  2-itemsets  generated  by  an  itemset,  called  the  work  load  due 
to  itemset  i,  is  given  as  lUj  =  n  —  i  —  1,  for  f  =  0, . . . ,  n  —  1.  For  example,  itemset  0 
contributes  nine  2-itemsets  {01, 02, 03, 04, 05, 06, 07, 08, 09}.  There  are  different  ways 
of  partitioning  this  class  among  V  processors. 

Block  partitioning  :  A  simple  block  partition  generates  the  assignment  A  = 
{0,1,2},  Ai  =  {3,4,5}  and  A  =  {6, 7, 8, 9},  where  A  denotes  the  itemsets  assigned 
to  processor  p.  The  resulting  workload  per  processor  is:  Wo  =  9  -|-  8  -|-  7  =  24, 
Wi  =  64-5-1-4  =  15  and  W2  =  3-t-2-|-l  =  6,  where  Wp  =  Ei ^  A-  We  can 
clearly  see  that  this  method  suffers  from  a  load  imbalance  problem. 

Interleaved  partitioning  :  A  better  way  is  to  do  an  interleaved  partition,  which 
results  in  the  assignment  A  =  0,3, 6,9,  A  =  T4,  7  and  A  =  2,5,8.  The  work  load 
is  now  given  as  Wo  =  94-6-1-3  =  18,  Wi  =  84-54-2  =  15  and  W2  =  74-4-4-1  —  12. 
The  load  imbalance  is  much  smaller,  however  it  is  still  present. 


Bitonic  Partitioning  (Single  Equivalence  Class):  We  propose  a  new  partition¬ 
ing  scheme,  called  bitonic  partitioning.  This  scheme  is  based  on  the  observation  that 
the  sum  of  the  workload  due  to  itemsets  i  and  (2P  —  f  —  1)  is  a  constant: 

Wi  4-  W2'p-i-i  =  n  —  i  —  l-\-{n  —  (2P  —  f  —  1)  —  1)  =  2n  —  2Et  —  1 
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We  can  therefore  assign  itemsets  i  and  (n  —  i  —  1)  as  one  unit  with  uniform  work 
{2n-2Vt  -  1). 


Figure  2:  Bitonic  Partitioning 

Let  r  =  n  mod  {2V).  If  r  ^  0,  then  there  is  an  imbalance  caused  by  the  last  r 
itemsets.  In  this  case  n  =  2Ps  —  r,  where  s  is  a  positive  integer.  Let  z  =  2Vs.  These 
itemsets  can  be  assigned  to  each  processor  as  follows: 

•^P  =  {y\y  =  P  +  j(2iF)  ovy  =  2V-p-l+  j{2V),  y  =  0,  -  •  • ,  s}  (1) 

The  remaining  r  itemsets  are  assigned  as: 

Ap  =  {y  <  n  \y  =  z  p  ox  y  =  z  27^  -  1}  (2) 

Going  back  to  our  example,  we  first  compute  r  =  10  mod  6  =  4.  The  assignments 
to  processors  is  given  as:  Aq  —  {0,5},  Ai  =  {1,4},  and  A^  =  {2,3}.  We  call 
this  approach  bitonic  partitioning,  since  the  itemsets  are  assigned  to  processors  in  an 
increasing  and  decreasing  fashion,  i.e.,  we  assign  itemsets  0,1,2  to  the  3  processors 
followed  by  assigning  itemsets  5, 4, 3  (instead  of  3, 4, 5,  cis  in  the  interleaved  case).  The 
remaining  4  itemsets  are  assigned  using  equation  2.  The  final  assignment  is  given  as 
Aq  =  {0,5,6},  Ai  =  {1,4,7},  and  A2  =  {2, 3,8,9},  with  corresponding  workload 
given  as  Wo  =  9  +  4  +  3  =  16,  Wi  =  8  +  5  +  2  =  15  and  VV2  =  7  +  6  +  1  =  14. 
This  partition  scheme  is  better  than  the  interleaved  scheme  and  results  in  almost 
no  imbalance.  Figure  2  illustrates  the  difference  between  the  interleaved  and  bitonic 
partitioning  schemes. 
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Bitonic  partitioning  (Multiple  Equivalence  Classes):  Above  we  presented  the 
simple  case  of  Ci,  where  we  only  had  a  single  equivalence  class.  In  general  we  may 
have  multiple  equivalence  classes.  Observe  that  the  bitonic  scheme  presented  above 
is  a  greedy  algorithm,  i.e.,  we  sort  all  the  Wi  (the  work  load  due  to  itemset  i),  extract 
the  itemset  with  maximum  Wi,  and  assign  it  to  processor  0.  Each  time  we  extract  the 
maximum  of  the  remaining  itemsets  and  assign  it  to  the  least  loaded  processor.  This 
greedy  strategy  generalizes  to  the  multiple  equivalence  class  as  well.  For  a  single  equiv¬ 
alence  class  all  the  work  loads  are  unique.  However,  for  multiple  classes  all  the  values 
may  not  be  distinct.  For  example,  let  =  {AB,  AC^AD,  AE,  BC,  BD,  BE,  DE}. 
The  sorted  list  based  on  the  work  load  due  to  each  itemset  (ignoring  those  with  no 
work  load)  is  given  as:  {wab  —  ^,wac  =  ‘^■,wbc  —  ^,wad  =  =  !}•  The 

resulting  assignment  is  Ai  =  {AC,  AD},  and  A2  =  {BD}.  The  total 

work  is  given  as:  Wo  =  3,  Wi  =  2  -|- 1  =  3  and  W2  =  2  -(- 1  =  3. 


Adaptive  Parallelism 

Let  m  be  the  total  number  of  items  in  the  database.  Then  there  are  potentially  (2) 
large  fc-itemsets  that  we  would  have  to  count  during  iteration  k  (however,  fortunately 
there  are  usually  only  a  small  number  of  large  itemsets).  is  maximum  at  the 

median  value  of  k,  after  which  it  starts  decreasing  (for  example,  for  we  have  the 
values  1, 5, 10, 10, 5, 1  for  successive  values  of  k).  This  suggests  a  need  for  some  form 
of  dynamic  or  adaptive  parallelization  based  on  the  number  of  large  fc-itemsets.  If 
there  aren’t  a  sufficient  number  of  large  itemsets,  then  it  is  better  not  to  parallelize 
the  candidate  generation. 

Parallel  Hash  Tree  Formation 

We  could  choose  to  build  the  candidate  hash  tree  in  parallel,  or  we  could  let  the 
candidates  be  temporarily  inserted  in  local  lists  (or  hash  trees).  This  would  have  to 
be  followed  by  a  step  to  construct  the  global  hash  tree. 

In  our  implementation  we  build  the  tree  in  parallel.  We  associate  a  lock  with  each 
leaf  node  in  the  hash  tree.  When  processor  i  wants  to  insert  a  candidate  itemset  into 
the  hash  tree  it  starts  at  the  root  node  and  hashes  on  successive  items  in  the  itemsets 
until  it  reaches  a  leaf  node.  At  this  point  it  acquires  a  lock  on  this  leaf  node  for 
mutual  exclusion  while  inserting  the  itemset.  However,  if  we  exceed  the  threshold  of 
the  leaf,  we  convert  the  leaf  into  an  internal  node  (with  the  lock  still  set).  This  implies 
that  we  also  have  to  provide  a  lock  for  all  the  internal  nodes,  and  the  processors  will 
have  to  check  if  any  node  is  acquired  along  its  downward  path  from  the  root.  This 
complication  only  arises  at  the  interface  of  the  leaves  and  internal  nodes. 

With  this  locking  mechanism,  each  process  can  insert  the  itemsets  in  different 
parts  of  the  hash  tree  in  parallel.  However,  since  we  start  with  a  hash  tree  with  the 
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root  as  a  leaf,  there  can  be  a  lot  of  initial  contention  to  acquire  the  lock  at  the  root. 
However,  we  did  not  find  this  to  be  a  significant  factor  on  12  processors. 


3.2  Support  Counting 

For  this  phase,  we  could  either  split  the  database  logically  among  the  processors  with 
a  common  hash  tree,  or  split  the  hash  tree  with  each  processor  traversing  the  entire 
database.  We  will  look  at  each  case  below. 


Partitioned  vs.  Common  Candidate  Hash  Tree 

One  approach  in  parallelizing  the  support  counting  step  is  to  split  the  hash  tree 
among  the  processors.  The  decisions  for  computation  balancing  directly  influence 
the  effectiveness  of  this  approach,  since  each  processor  should  ideally  have  the  same 
number  of  itemsets  in  it’s  local  portion  of  the  hash  tree.  Another  approach  is  to 
keep  a  single  common  hash  tree  among  all  the  processors.  There  are  several  ways  of 
incrementing  the  count  of  itemsets  in  the  common  candidate  hash  tree. 


Counter  per  Itemset:  Let  assume  that  each  itemset  in  the  candidate  hash  tree  has 
a  single  count  field  associated  with  it.  Since  the  counts  are  common,  more  than  one 
processor  may  try  to  access  the  count  field  and  increment  it.  We  thus  need  a  locking 
mechanism  to  provide  mutual  exclusion  among  the  processors  while  incrementing  the 
count.  This  approach  may  cause  contention  and  degrade  the  performance,  however, 
since  we  are  using  only  12  processors,  and  the  sharing  is  very  fine-grained  (at  the 
itemset  level),  we  found  this  to  be  the  best  approach.  ^ 


Partitioned  vs.  Common  Database 

We  could  either  choose  to  logically  partition  the  database  among  the  processors, 
or  each  processor  can  choose  to  traverse  the  entire  database  for  incrementing  the 
candidate  support  counts. 


^we  also  implemented  two  other  versions  described  below,  but  we  did  not  get  any  improvement  in 
performance.  1)  Separate  Counters:  we  associate  different  count  fields  with  each  itemset,  one  per 
processor.  This  would  obviate  the  need  for  synchronization.  However,  it  niay  cause  false  sharing, 
since  the  counters  would  probably  share  the  same  cache  line,  causing  the  line  to  be  exchanged 
between  the  processors  even  though  the  processors  are  accessing  different  counters.  A  solution  is 
to  pad  the  counters  to  the  cache  line.  This  eliminates  false  sharing,  but  it  consumes  more  memory. 
2)  Local  Counter  Array:  The  other  approach  is  to  keep  a  HCfcll  length  local  counter  array  per 
processor,  with  an  entry  for  each  itemset.  This  avoids  both  contention  and  false  sharing.  We  would 
expect  this  to  work  well  for  large  number  of  processors.  However  any  gain  in  the  support  counting 
step  will  be  offset  by  the  extra  reduction  step  at  the  end. 
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Balanced  Database  Partitioning:  In  our  implementation  we  partition  the  database 
in  a  blocked  fashion  among  all  the  processors.  However,  this  strategy  may  not  result 
in  balanced  work  per  processor.  This  is  because  the  work  load  is  a  function  of  the 
length  of  the  transactions.  If  /<  is  the  length  of  the  transaction  t,  then  during  iteration 
k  of  the  algorithm,  we  have  to  test  whether  all  the  subsets  of  the  transaction  are 
contained  in  Ck-  Clearly  the  complexity  of  the  work  load  for  a  transaction  is  given  as 
0{min{lt^ i.e.,  it  is  polynomial  in  the  transaction  length.  This  also  implies 
that  a  static  partitioning  won’t  work.  However,  we  could  devise  static  heuristics  to 
approximate  a  balanced  partition.  For  example,  one  static  heuristic  is  to  estimate 
the  maximum  number  of  iterations  we  expect,  say  T.  We  could  then  partition  the 
database  based  on  the  mean  estimated  work  load  for  each  transaction  over  all  iter¬ 
ations,  given  as  (ELi  Another  approach  is  to  re-partition  the  database  in 

each  iteration.  In  this  case  it  is  important  to  respect  the  locality  of  the  partition 
by  moving  transactions  only  when  it  is  absolutely  necessary.  We  plan  to  investigate 
different  partitioning  schemes  as  part  of  future  work. 


3.3  Parallel  Data  Mining:  Algorithms 

Based  on  the  discussion  in  the  previous  section,  we  consider  the  following  algorithms 
for  mining  association  rules  in  parallel: 

•  Common  Candidate  Partitioned  Database  (CCPD):  This  algorithm  uses 
a  common  candidate  hash  tree  across  all  processors,  while  the  database  is  logi¬ 
cally  split  among  them.  The  hash  tree  is  built  in  parallel  (see  section  3.1).  Each 
processor  then  traverses  its  local  database  and  counts  the  support  (see  section 
3.2)  for  each  itemset.  Finally,  the  master  process  selects  the  large  itemsets. 

•  Partitioned  Candidate  Common  Database  (PCCD):  This  has  a  par¬ 
titioned  candidate  hash  tree,  but  a  common  database.  In  this  approach  we 
construct  a  local  candidate  hash  tree  per  processor.  Each  processor  then  tra¬ 
verses  the  entire  database  and  counts  support  for  itemsets  only  in  its  local  tree. 
Finally  the  master  process  performs  the  reduction  and  selects  the  large  itemsets 
for  the  next  iteration. 

Note  that  the  common  candidate  common  database(CCCD)  approach  results  in 
duplicated  work,  while  the  partitioned  candidate  partitioned  database  (PCPD)  ap¬ 
proach  is  more  or  less  equivalent  to  CCPD.  For  this  reason  we  did  not  implement 
these  parallelizations. 


4  Optimizations 

In  this  section  we  present  some  optimizations  to  the  association  rule  algorithm.  These 
optimizations  are  beneficial  for  both  sequential  and  parallel  implementation. 
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4.1  Hash  Tree  Balancing 


Although  the  computation  balancing  approach  results  in  balanced  work  load,  it  does 
not  guarantee  that  the  resulting  hash  tree  is  balanced. 

Balancing  C2  (No  Pruning)  :  We’ll  begin  by  a  discussion  of  tree  balancing  for 
(72,  since  there  is  no  pruning  step  in  this  case.  We  can  balance  the  hash  tree  by  using 
the  bitonic  partitioning  scheme  described  above.  We  simply  replace  P,  the  number 
of  processors  with  the  fan-out  T  for  the  hash  table.  We  label  the  n  large  1-itemsets 
from  0  to  n  —  1  in  lexicographical  order,  and  use  P  =  T  io  derive  the  assignments 
Aq,  ■  ■  ■ ,  Aj^-i  for  each  processor.  Each  Ai  is  treated  as  an  equivalence  class.  The  hash 
function  is  based  on  these  equivalence  classes,  which  is  simply  given  as,  h{i)  =  Ai^ 
for  z  =  0,  •  •  • ,  .P.  The  equivalence  classes  are  implemented  via  an  indirection  vector 
of  length  n.  For  example,  let  L\  =  {A,  D,  E,G,  Ki  M,  N,  S,T,  Z}.  We  first  label 
these  as  {0, 1,2, 3,4, 5,6, 7, 8, 9}.  Assume  that  the  fan-out  E  —  3.  We  thus  obtain 
the  3  equivalence  classes  Ao  =  {0,5,6},  Ai  =  {1,4,7},  and  A2  =  {2,3, 8, 9},  and 
the  indirection  vector  is  shown  in  table  1.  Furthermore,  this  hash  function  is  applied 
at  all  levels  of  the  hash  tree.  Clearly,  this  scheme  results  in  a  balanced  hash  tree 
as  compared  to  the  simple  g{i)  =  i  mod  T  hash  function  (which  corresponds  to  the 
interleaved  partitioning  scheme  from  section  3.1). 


Label 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Hash  Value 

0 

1 

2 

2 

1 

0 

0 

1 

2 

2 

Table  1:  Indirection  Vector 


Balancing  C^ik  >  2)  :  Although  items  can  be  pruned  for  iteration  k  >  3,  we  use 
the  same  bitonic  partitioning  scheme  for  C3  and  beyond.  Below  we  show  that  even 
in  this  general  case  bitonic  hash  function  is  very  good  as  compared  to  the  interleaved 
scheme.  Theorem  1  below  establishes  an  upper  and  lower  bound  on  the  number  of 
itemsets  per  leaf  for  the  bitonic  scheme. 

Theorem  1  Let  k  >  1  denote  the  iteration  number,  I  =  {0,  . . . ,  d  —  1}  the  set  of 
items,  T  the  fan-out  of  the  hash  table,  T  =  {0,  ...,  P"  —  1}  the  set  of  equivalence 
classes  modulo  J-,T  =  T^  the  total  number  of  leaves  in  Ck,  and  Q  the  family  of  all 
size  k  ordered  subsets  of  I,  i.e.,  the  set  of  all  k-itemsets  that  can  be  constructed  from 
items  ini.  Suppose  ^  is  an  integer  and  >  k.  Define  the  bitonic  hash  function 

h  :  I  ^  T  by: 

h(i)  =  i  mod  T  if0<{i  mod  21')  <  IF  and  2T  —  {i  mod  2T)  otherwise, 
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and  the  mapping  H  :  Q  T  from  k-itemsets  to  the  leaves  of  Ck  by  H{ai  Ok)  — 
(/i(ai),  . . . ,  h(ak)).  Then  for  every  leaf  B  =  (6i,  ...,bk)  eT,  the  ratio  of  the  num¬ 
ber  of  k-itemsets  in  the  leaf  to  the  average  number  of  itemsets  per  leaf 

(1|^||/||T||j  is  bounded  above  and  below  by  the  expression, 


-  djT  < 


lieil/lirii 


Proof.  Let  w  =  dj2T.  Partition  X  into  disjoint  w  intervals  of  size  2T.  Split 
each  of  the  intervcils  into  halves,  each  of  size  ,  where  the  smaller  half  is  called  the 
left  block  and  the  other  the  right  block.  Let  5  =  (6i, . . . ,  bk)  G  T  be  given.  For  each 
i,l<i  <k,  the  elements  in  h~^{bi)  are  enumerated  as 


=  k,  iV.M,s  =  {2r-l)-  k,  2^  +  k,  N,,,,n  U'  -  1)  - 


def 


. . . ,  {2w  -  2)T  +  {2wT  -  1)  -  h. 


def 


Clearly,  Ni^j,L  is  in  the  jth  left  block  and  Nij^R  in  the  jth  right  block.  So,  for 
any  A  =  (oi,  . . . ,  Ok)  £  A  e  H~^{B)  if  and  only  if  there  exist  a  sequence  a  = 
[ci,  . . . ,  Cfc]  e  {L,  R}'^  and  a  nondecreasing  sequence  r  =  [ji,  . . . ,  it]  G  {1,  . . . , 


(3) 


such  that 

<  •  ■  •  <  Nk 

For  every  A  =  (oi,  . . . ,  Ok)  G  H-\B),  there  is  a  unique  (<7,r)  satisfying  (3).  So, 
\\H~'^{B)\\  is  the  number  of  ((7,r)  satisfying  (3).  Let  us  fix  a  and  then  partition  a 
into  maximal  subsequences,  each  consisting  of  a  single  value.  Then,  for  every  r,  (cr,  r) 
satisfies  (3)  if  and  only  if 

<  •  •  •  <  NqJg^Cq  (4) 


N, 


holds  for  every  maximal  subsequence  [jp,  . . . ,  jq].  Note  here  for  any  maximal  subse¬ 
quence  [jp,  jq],  that  (i)  if  p  =  9,  then  (4)  holds  regardless  of  Cp,  (ii)  if  p  +  1  =  q, 
then  (4)  holds  with  Cp  =  0  and  c,  =  1,  and  (iii)  if  p  <  9  and  if  (4)  holds,  then 
[cp,  . . . ,  Cg]  is  nondecreasing,  where  L  precedes  R.  Furthermore,  we  claim  that  (iv)  if 
p  <  q,  then  there  are  at  most  2  choices  for  [cp, . . . ,  Cg]  to  satisfy  (4).  Assume,  by  way  of 
contradiction,  that  there  are  three  subsequences  to  satisfy  (4)  and  take  the  largest  and 
the  smallest  of  the  three.  By  (iii),  they  are  and  for  some  u,  v 

with  u  <v  —  2.  Since  (4)  holds  for  both  subsequences,  we  have  A^u,j„,L  < 
and  Nu,j,,,R  <  Nu+i,j,,+,,R-  However,  this  is  impossible  because  iVuj„,z,  <  A^u+i,i„+i,L  if 
and  only  if 

Now  from  (iv)  it  follows  that  \\H~^{B)\\  is  maximized  if  B  guarantees  the  existence 
of  two  [cp,  •  •  • ,  Cg]  to  satisfy  (4)  for  every  maximal  subsequence.  An  example  of  such 
B  is  {0,1,  k  —  1).  For  this  specific  example,  (<r,  r)  satisfies  (3)  if  and  only  if  for 
every  maximal  subsequence  [jp,  . . . ,  jg]  of  cr,  [cp,  . . . ,  Cg]  is  either  or  R. 

This  implies  that  ■••,k-  l)|j  is  equal  to  the  number  of  ways  of  choosing  a 
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block  from  the  2w  blocks  k  times  under  the  constraint  that  each  right  block  can  be 
chosen  at  most  once.  Thus, 


\\h-Ho 


i=0 


By  dividing  the  right  hand  side  by  ||^||/||T||,  which  is  we  get  the  upper 

On  the  other  hand,  is  minimized  if  B  does  not  allow  to  choose  two 

o’s  from  the  same  block.  An  example  of  such  B  is  (0,0,  ■  •  • ,  0).  For  this  specific 
example,  ||f7"^(B)||  is  By  dividing  this  number  by  ||^||/||T||,  we  get  the  lower 

bound  ( =  (1  ~  ^  This  proves  the  theorem.  I 

We  obtain,  by  a  similar  analysis,  the  same  lower  and  upper  bound  for  the  inter¬ 
leaved  hash  function  also.  However,  the  two  functions  behave  differently.  Note  that 

the  average  number  of  A;-itemsets  per  leaf  ||^||/||T||  is  ~  •  Let  a{w) 

denote  this  polynomial.  We  say  that  a  leaf  has  a  capacity  close  to  the  average  if  its 

capacity,  which  is  a  polynomial  in  w  of  degree  at  most  fc,  is  of  the  form  — [- 

with  I3{w)  being  a  polynomial  of  degree  at  most  k  —  2. 

For  the  bitonic  hash  function,  a  leaf  specified  by  the  hash  values  (ui  . . . ,  Ofc)  has 
capacity  close  to  a(w)  if  and  only  if  a*-  ^  Oj+i  for  alH,  1  <  i  —  1.  Thus,  there  are 
—  1)*-^  such  leaves,  and  so,  (1  —  fraction  of  the  leaves  have  capacity 

close  to  a{w).  Note  also  that  clearly,  (1  —  approaches  1. 

On  the  other  hand,  for  the  interleaved  hash  function,  a  leaf  specified  by  (ui  . . . ,  Ofc) 
has  capacity  close  to  a{w)  if  and  only  if  a,-  /  Uj+i  for  all  i,  and  the  number  of  i  such 
that  a,  <  Oj+i  is  equal  to  {k  —  l)/2.  So,  there  is  no  such  leaf  if  k  is  even.  For  odd 
k  >  3,  the  ratio  of  the  “good”  leaves  decreases  as  k  increases,  achieving  a  maximum 
of  2/3  when  k  =  3.  Thus,  at  most  2/3  of  the  leaves  achieve  the  average. 

From  the  above  discussion  it  is  clear  that  while  both  the  simple  and  bitonic  hash 
function  have  the  same  maximum  and  minimum  bounds,  the  distribution  of  the  num¬ 
ber  of  itemsets  per  leaf  is  quite  different.  While  a  significant  portion  of  the  leaves 
are  close  to  the  average  for  the  bitonic  case,  only  a  few  are  close  in  the  simple  hash 
function  case. 


4.2  Short-circuited  Subset  Checking 

Recall  that  while  counting  the  support,  once  we  reach  a  leaf  node,  we  check  whether 
all  the  itemsets  in  the  leaf  are  contained  in  the  transaction.  This  node  is  then  marked 
as  VISITED  to  avoid  processing  it  more  than  once  for  the  same  transaction.  A 
further  optimization  is  to  associate  a  VISITED  flag  with  each  node  in  the  hash  tree. 
We  mark  an  internal  node  as  VISITED  the  first  time  we  touch  it.  This  enables 
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Figure  3:  Candidate  Hash  Tree  (Cs) 

us  to  preempt  the  search  as  soon  as  possible.  We  would  expect  this  optimization 
to  be  of  greatest  benefit  when  the  transaction  sizes  are  large.  For  example,  if  our 
transaction  is  T  =  {A,  jB,  (7,  Z),  £^},  k  =  3,  fan-out  =  2,  then  all  the  3-subsets  of  T 
are:  {ABC,ABD,ABE,ACD,ACE,ADE,BCD,BCE,BDE,CDE}.  Figure  3  shows  the 
candidate  hash  tree  C3.  We  have  to  increment  the  support  of  every  subset  of  T 
contained  in  C3,  We  begin  with  the  subset  ABC,  and  hash  to  node  11  and  process 
all  the  itemsets.  In  this  downward  path  from  the  root  we  mark  nodes  1,  4,  and 
11  as  visited.  We  then  process  subset  ADB,  and  mark  node  10.  Now  consider  the 
subset  CDE.  We  see  in  this  case  that  node  1  has  already  been  marked,  and  we  can 
preempt  the  processing  at  this  very  stage.  This  approach  can  however  consume  a  lot 
of  memory.  For  a  given  fan-out  for  iteration  k,  we  need  additional  memory  of  size 
to  store  the  flags.  In  the  parallel  implementation  we  have  to  keep  a  VISITED 
field  for  each  processor,  bringing  the  memory  requirement  to  V  ■  This  can  get 
very  large,  especially  with  increasing  number  of  processors. 

Reducing  Memory  Requirement 

We  now  present  a  way  to  reduce  the  memory  from  to  only  k  ■  T.  This  is  based 
on  the  observation  that  we  only  need  to  maintain  a  vector  of  length  T  at  each  depth 
(which  is  equal  to  the  current  iteration,  k).  Since  we  generate  the  k  subsets  of  a 
transaction  in  lexicographical  order,  once  we  hash  on  item  i,  with  the  hash  value 
h[i),  at  depth  i,  we  are  guaranteed  to  have  looked  at  all  subsets  of  th6  transaction 
whose  i-th  item  hashes  to  the  same  hash  value.  Thus  we  need  only  one  array  entry 
for  each  hash  value  at  a  given  depth,  i.e,  we  only  need  an  array  of  length  equal  to  the 
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Figure  4:  Short  Circuiting  Subset  Checking 


fan-out  T.  Each  time  we  go  down  a  level  we  draw  a  new  unique  flag  value  from  a 
simple  counter.  The  flag  value  remains  the  same  while  going  up  the  hash  tree.  This 
process  is  illustrated  in  figure  4.  Let  the  transaction  T  =  {A.B^C^D^E},  k  =  S, 
fan-out  JF  =2.  So  we  need  an  array  7^  of  size  k  •  J-  =  S  •  2  =  6,  for  our  short-circuit 
optimization.  The  hash  value  of  A,C,  and  E  is  1,  while  the  hash  value  of  B  and  D 
is  0.  We  have  to  look  at  all  the  3-subsets  of  T  in  lexicographical  order.  The  first 
subset  is  ABC.  We  initialize  our  flag  value  counter  to  1.  We  hash  on  item  A  at  the 
root  of  Cs  (given  in  figure  3)  at  depth  0,  set  the  array  element  7^(0,!)  to  1  (note: 
For  7^(0, 1),  the  row  represents  the  depth,  and  the  column  the  hash  value.  We  then 
hash  on  B  at  depth  2,  and  set  the  7^(1, 0)  to  2.  After  hashing  on  C  at  depth  3,  we 
mark  7^(2, 1)  with  3.  At  this  point  we  are  at  the  node  11  in  C3.  We  process  all  the 
items  in  its  children  (the  leaf  node).  At  depth  3,  we  look  at  the  next  item  in  the 
itemset,  Z),  and  mark  7^(2, 0)  with  3.  However,  when  we  try  to  process  ABE,  we 
see  that  7Z(2, 1)  has  already  been  marked,  so  we  move  to  the  next  subset  ACD.  We 
have  to  move  up  a  depth,  i.e.,  go  go  back  to  depth  2  and  look  at  the  item,  C,  and 
mark  7Z(1, 1)  with  2.  When  we  descend  to  depth  3,  we  invoke  a  new  flag  value  4,  and 
mark  7Z(2, 0)  with  4.  This  process  continues  till  all  subsets  have  been  processed.  To 
realize  the  effectiveness  of  this  optimization,  consider  the  case  of  CDE.  In  this  case 
we  can  pre-empt  the  search  at  depth  0,  since  all  hash  values  have  been  visited  (see 
the  last  step  in  figure  4).  For  the  parallel  version  we  keep  a  local  copy  of  a  fc  •  JF  sized 
array  for  each  processor  (total  k  -  T  -V)  .  This  approach  avoids  the  problem  of  false 
sharing,  and  also  conserves  memory. 
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5  Experimental  Evaluation 

5.1  Experimental  Setup 


Database 

T 

I 

V 

Total  Size 

T5.I2.D100K 

5 

2 

100,000 

2.6MB 

T10.I4.D100K 

10 

4 

100,000 

4.3MB 

T15.I4.D100K 

15 

4 

100,000 

6.2MB 

T20.I6.D100K 

20 

6 

100,000 

7.9MB 

T10.I6.D400K 

10 

6 

400,000 

17.1MB 

T10.I6.D800K 

10 

6 

800,000 

34.6MB 

T10.I6.D1600K 

10 

6 

1,600,000 

69.8MB 

Table  2:  Database  properties 


All  the  experiments  were  performed  on  a  12-node  SGI  Power  Challenge  shared- 
memory  multiprocessor.  Each  node  is  a  MIPS  processor  running  at  lOOMHz.  There  s 
a  total  of  256MB  of  main  memory.  The  primary  cache  size  is  16  KB  (64  bytes  cache 
line  size),  with  different  instruction  and  data  caches,  while  the  secondary  cache  is  1 
MB  (128  bytes  cache  line  size).  The  databases  are  stored  on  an  attached  2GB  disk. 
All  processors  run  IRIX  5.3,  and  data  is  obtained  from  the  disk  via  a  NFS  file  server. 


6 

Iterations 


10 


12 


Figure  5:  Large  Itemsets  per  Iteration 

We  used  different  synthetic  databases  with  size  ranging  form  SMB  to  70MB,  gen¬ 
erated  using  the  procedure  described  in  [4].  These  database  mimic  the  transactions 
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in  a  retailing  environment.  Each  transaction  has  a  unique  ID  followed  by  a  list  of 
items  bought  in  that  transaction.  The  data-mining  provides  information  about  the 
set  of  items  generally  bought  together.  Table  2  shows  the  databases  used  and  their 
properties.  The  number  of  transactions  is  denoted  as  |2?|,  average  transaction  size 
as  |T|,  and  the  average  maximal  potentially  large  itemset  size  as  |/|.  The  number  of 
maximal  potentially  large  itemsets  \L\  =  2000,  and  the  number  of  items  N  =  1000. 
We  refer  the  reader  to  [4]  for  more  detail  on  the  database  generation.  All  the  experi¬ 
ments  were  performed  with  a  minimum  support  value  of  0.5%,  and  a  leaf  threshold  of 
2  (i.e.,  max  of  2  itemsets  per  leaf).  Figure  5  shows  the  number  of  iterations  and  the 
number  of  large  itemsets  found  for  different  databases.  In  the  following  sections  all 
the  results  are  reported  for  the  CCPD  parallelization.  We  do  not  present  any  results 
for  the  PCCD  approach  since  it  performs  very  poorly,  and  results  in  a  speed-down 
on  more  than  one  processor. 

5.2  Computation  and  Hash  Tree  Balancing 


Figure  6:  Effect  of  Computation  and  Hash  Tree  Balancing 


Figure  6  shows  the  improvement  in  the  performance  obtained  by  applying  the  compu¬ 
tation  balancing  optimization  (discussed  in  section  3.1),  and  the  hash  tree  balancing 
optimization  (described  in  section  4.1).  The  figure  shows  the  %  improvement  over 
a  run  without  any  optimizations.  Results  are  presented  for  different  databases  and 
on  different  number  of  processors.  We  first  consider  only  the  computation  balancing 
optimization  (COMP)  using  the  multiple  equivalence  classes  algorithm.  As  expected 
this  doesn’t  improve  the  execution  time  for  the  uni-processor  case,  as  there  is  nothing 
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to  balance.  However,  it  is  very  effective  on  multiple  processors.  We  get  an  improve¬ 
ment  of  around  20%  on  8  processors.  The  second  column  (for  all  processors)  shows 
the  benefit  of  just  balancing  the  hash  tree  (TREE)  using  our  bitonic  hashing  (the 
unoptimized  version  uses  the  simple  mod  interleaved  hash  function).  Hash  tree  bal¬ 
ancing  by  itself  is  an  extremely  effective  optimization.  It  improves  the  performance 
by  about  30%  even  on  uni-processors.  On  smaller  databases  and  8  processors  however 
it  is  not  as  good  as  the  COMP  optimization.  The  reason  that  the  hash  tree  balancing 
is  not  sufficient  to  offset  inherent  load  imbalance  in  the  candidate  generation  in  this 
case.  The  most  effective  approach  is  to  apply  both  optimizations  at  the  same  time 
(COMP-TREE).  The  combined  effect  is  sufficient  to  push  the  improvements  in  the 
40%  range  in  the  multiple-processor  case.  On  1  processor  only  hash  tree  balancing  is 
beneficial,  since  computation  balancing  only  adds  extra  cost. 


5.3  Short-circuited  Subset  Checking 


T5.I2.r>100K:  T10.I6.ID800K:  T15.I4.1D100K:  X20.I<5.i:>  1 OOK 

#  procs  across  Databases 


Figure  7:  Effect  of  Short-circuited  Subset  Checking 


Table  7  shows  the  improvement  due  to  the  short-circuited  subset  checking  optimiza¬ 
tion  with  respect  to  the  unoptimized  version  (in  the  unoptimized  version  only  the 
leaves  are  marked  as  visited,  to  avoid  counting  an  itemset  more  than  once).  The 
results  are  presented  for  different  number  of  processors  across  different  databases. 
The  results  indicate  that  while  there  is  some  improvement  for  databases  with  small 
transaction  sizes,  the  optimization  is  most  effective  when  the  transaction  size  is  large. 
In  this  case  we  get  improvements  of  around  25%  over  the  unoptimized  version. 

To  gain  further  insight  into  this  optimization,  consider  figure  8.  It  shows  the 
percentage  improvement  obtained  per  iteration  on  applying  this  optimization  on  the 
T20.I6.D100K  database.  It  shows  results  only  for  the  uni-processor  case,  however 
similar  results  were  obtained  on  more  processors.  We  observe  that  as  the  iteration  k 
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T20.I6.D100K 


2  3  4  5  6  7  8  9  10  11  12 

Iterations 

Figure  8:  %  Improvement  per  Iteration  (#  proc  =  1) 

increases,  there  is  more  opportunity  for  short-circuiting  the  subset  checking,  and  we 
get  increasing  benefits  of  upto  60%.  The  improvements  start  to  fall  off  at  the  high 
end  where  the  number  of  candidates  becomes  small,  resulting  in  a  small  hash  tree  and 
less  opportunity  for  short-circuiting.  It  becomes  clear  that  is  an  extremely  effective 
optimization  for  larger  transaction  sizes,  and  in  cases  where  there  are  large  number 
of  candidate  A:-itemsets. 

5.4  Common  Candidate  Hash  Tree;  Contention 


T5J2.D100K  T10.I4.DJOOK  T1314.D100K  T20.I6.D100K  T10,16.D400K  T10,16,D8O0K  T10,I6.D1600K 


Databases  with  #  procs  =  (2,4,8,12) 

Figure  9:  %  Improvement  with  NOLCK:  Upper  bound  on  Contention 

Recall  our  discussion  in  section  3.2  on  keeping  a  common  hash  tree  among  the  pro¬ 
cessors,  with  a  support  counter  per  itemset.  Mutual  exclusion  for  incrementing  the 
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counter  is  provided  by  a  locking  mechanism  at  the  itemset  level.  Here  we  study  the 
amount  of  potential  contention  that  may  be  caused  in  this  approach.  This  is  done  by 
eliminating  the  locks  altogether  (NOLCK)  and  measuring  the  execution  time.  Figure 
9  shows  the  %  difference  in  execution  time  of  NOLCK  with  respect  to  the  version  with 
the  locking  mechanism  for  each  itemset.  Note  that  the  NOLCK  time  also  eliminates 
the  extra  instructions  to  acquire  and  release  the  locks,  and  thus  represents  an  upper 
bound  on  the  amount  of  contention.  For  all  the  databases  and  processors  we  get  a 
difference  within  4%,  which  leads  us  to  conclude  that  contention  is  not  a  big  problem 
due  to  the  fine-grained  locking  of  the  support  counters  at  the  itemset  level. 

5.5  Parallel  Performance 

CCPD  CCPD :  With  Reading  Time 


Figure  10:  CCPD:  Speed-up  a)  without  reading  time  b)  with  reading  time 

Figure  10  presents  the  speedups  obtained  on  different  databases  and  different  pro¬ 
cessors  for  the  CCPD  parallelization.  The  results  presented  on  CCPD  use  all  the 
optimization  studied  above  -  computation  balancing,  hash  tree  balancing  and  short- 
circuited  subset  checking.  The  figure  on  the  left  presents  the  speed-up  without  taking 
the  initial  database  reading  time  into  account.  We  observe  that  as  the  number  of 
transactions  increase  we  get  increasing  speed-up,  with  a  speed-up  of  more  than  8  on 
12  processors  for  the  largest  database  (T10.I6.D1600K,  with  1.6  million  transactions). 
However,  if  we  were  to  account  for  the  database  reading  time,  then  we  get  speed-up 
of  only  4  on  12  processors. 

Table  3  shows  the  total  time  spent  reading  the  database,  and  the  percentage  of 
total  time  this  constitutes  on  different  number  of  processors.  The  results  indicate 
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Database 

Reading 

Time 

%  of  Total  Time 

P  =  1 

V  =  2 

V  =  4 

r  =  8 

V  =  12 

T5.I2.D100K 

9.1s 

39.9 

43.8 

52.6 

56.8 

59.0 

T10.I4.D100K 

13.7s 

15.6 

22.2 

29.3 

36.6 

39.8 

T15.I4.D100K 

18.9s 

8.9 

14.0 

21.6 

29.2 

32.8 

T20.I6.D100K 

24.1s 

4.9 

8.1 

12.8 

18.6 

22.4 

T10.I6.D400K 

55.2s 

16.8 

24.7 

36.4 

48.0 

53.8 

T10.I6.D800K 

109.0s 

19.0 

29.8 

43.0 

56.0 

62.9 

T10.I6.D1600K 

222.0s 

19.4 

28.6 

44.9 

59.4 

66.4 

Table  3:  Database  Reading  Time 

that  on  12  processors  upto  60%  of  the  time  can  be  spent  just  on  I/O.  This  suggest 
a  great  need  for  parallel  I/O  techniques  for  effective  parallelization  of  data  mining 
applications  since  by  its  very  nature  data  mining  algorithms  must  operate  on  large 
amounts  of  data. 


6  Conclusions 

In  this  paper,  we  presented  a  parallel  implementation  of  the  Apriori  algorithm  on 
the  SGI  Power  Challenge  shared  memory  multi-processor.  We  also  discussed  a  set 
of  optimizations  which  include  optimized  join  and  pruning,  computation  balancing 
for  candidate  generation,  hash  tree  balancing,  and  short-circuited  subset  checking. 
We  then  presented  experimental  results  on  each  of  these.  Improvements  of  more  than 
40%  were  obtained  for  the  computation  and  hash  tree  balancing.  The  short-circuiting 
optimization  was  found  to  be  extremely  effective  for  databases  with  large  transaction 
sizes.  Finally  we  reported  the  parallel  performance  of  the  algorithm.  While  we 
achieved  good  speed-up,  we  observed  a  need  for  parallel  I/O  techniques  for  further 
performance  gains. 
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